Thoughts and Theory

Model-Agnostic Local Explanation Models From a Statistical Viewpoint I

Without understanding the mathematical expressions of local explanation methods, their attributions can be counterintuitive. This blog post series provides an introduction to the most used local explanation methods and discusses their locality.

Sahra Ghalebikesabi

Published in

Towards Data Science

6 min readJul 1, 2021

1 Introduction

In a recent paper (Ghalebikesabi et al., 2021) we have shown that local explanation models are not as local as one might assume. As more open-source implementations of model explainability tools become available, it gets more and more important to understand what these methods are indeed doing and why their local interpretation is limited. In this blog post series, we want to focus on the statistical analysis and discussion of four local model-agnostic explainability tools:

Tangent Line Approximation
LIME
SHAP
Neighbourhood SHAP

If you already know what model explainability, interpretable AI or XAI is, wait until Part 2!

Introduction to Model Explainability

So what are model explanations? Model explanations can be vaguely summarised by explanations of a black-box model. Typically they provide the user with feature attributions, which are importance scores for each feature. Since “explaining a model” could mean anything, a better question to ask which questions explanation models answer. There are different reasons, a user might be interested in explaining a black box, and we cover some of them now.

Understanding. First of all, a user might be interested in why a black box makes a certain prediction. Why is a CNN classifying an image of Husky as a wolf?
Trust. Explanations can make the user trust or distrust the model. Ribeiro et al. (2016) show that users are less likely to trust a black box model if the explanations are counter-intuitive as in the Husky example.
Feature Selection. Model explainability tools can also be used to select a subset of features to prevent overfitting.
Actionable Advice. Model explanations in the form of feature attributions can be used to provide the user with advice on what to change to receive a different black box response. A negative attribution of a high debt of a credit applicant who has been denied credit is a hint that decreasing the debt can lead to credit approval.

Dependent on the question, the data analyst is interested in answering, a different explanation model should be used. Explanation models can be classified according to the following dimensions (see Tanner (2019))

Image copied from Ribeiro et al. (2016).

Analysis Stage. The explanation model can be intrinsic or post-hoc. Intrinsic explanation models are simple models such as rule-based models or additive functions, i.e. in the form of a GAM, that replace the black box and make the predictions. Post-hoc explanation models are used to explain a black box model after it was already trained.
Model Specificity. Some explanation models are specific to the underlying black box. For instance, saliency maps work only on differentiable functions, while model-agnostic interpretability tools can be applied to any model.
Locality. Last but not least, explainability tools can be differentiated based on their locality. While some of them explain the local model behaviour at an instance of interest x, others explain the global model behaviour of the black-box.

In this blog post series, we focus on local model-agnostic explanation models used for post-hoc interpretability.

2 Tangent Approximations

What is the easiest way to explain any model locally? One of the first concepts learned in any Analysis class is a first-order Taylor approximation: A black-box function f(x) can be explained at an instance of interest x by fitting a tangent line around x. As we assume black-box access to the model, this tangent line has to be approximated by evaluating the model fit in a small neighbourhood around x, as is done in locally linear kernel regression. Examples of tangent approximations are MAPLE (Plumb, 2018) or MeLIME (Botari, 2020) for tabular data sets. LIME as presented by Ribeiro et al. (2016) was formulated as a generalisation of tangent line approximations.

The picture shows

a black box classification model f: pink and blue areas
an instance being explained: bold red cross
instances sampled locally and weighted by their proximity: red crosses, and blue circles
a locally faithful explanation g: dashed line

According to Ribeiro et al. (2016), for an interpretability model to be locally faithful, it “must correspond to how the model behaves in the vicinity of the instance being predicted”. As we will see later, this definition leaves room for much interpretation! Mathematically, LIME solves the following optimization problem

where G is the family of explanation models (i.e. linear models in the case of tangent approximations), L is a loss function that ensures that g is fitted to f in a local neighbourhood around x which is identified by the weighting kernel $\pi_x$ which is typically an exponential kernel of a distance function, i.e. the Euclidean distance, with a fixed bandwidth $\sigma$ that is $\pi_x=exp(-|x^*-x|/\sigma)$. Eventually, $\Sigma(g)$ is a penalty term, i.e. the l1 loss, that prevents overfitting and increases the simplicity of g.

For illustration purposes let us now consider the black box function given by

which behaves positively in x_2 whenever x_1 is positive, and negatively in x_2 whenever x_1 is non-positive.

Feature Attributions computed with a Local Ridge regression with bandwidth 1 (image by author).

If we fit a tangent line locally using the above equation to x=(x_1, 2), the attributions of Feature-2 are positive for all x_1>0 and increase the larger x_1, which adheres to the property of an increasing gradient of the parabola. All in all, tangent line approximations look great, don’t they? For absolutely small negative values of x_1, their attribution is counter-intuitive. For an x_1=-0.1, we see that the attribution of x_1 is positive even though the local model behaviour would be expected to be negative! This happens since the positive effect of x_2 of neighbours around x outweighs the negative effect of the neighbours.

So apparently tangent line approximations lose information contained in the black box model. This is not surprising: We are trying to predict the local model behaviour with a first-order Taylor approximation. If it was a good fit locally, we should have used a local linear black box from the beginning. See Rudin (2019) for an extensive discussion.

In the next blog post, we dive into Shapley values and explain why SHAP and LIME are not as local as they are often claimed to be!

References

[1] T. Botari, F. Hvilshøj., R. Izbicki, A. C. de Carvalho (2020). MeLIME: Meaningful local explanation for machine learning models. arXiv preprint arXiv:2009.05818.

[2] S. Ghalebikesabi, L. Ter-Minassian, K. Diaz-Ordaz, and C. Holmes (2021). On Locality of Local Explanation Models. arxiv preprint arXiv:2106.14648.

[3] M. Ribeiro, S. Singh, and C. Guestrin (2016). “Why should I trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining.

[4] C. Rudin (2019). Stop explaining black ox machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1.5: 206–215

[5] G. Plumb, D. Molitor, A. Talwalkar. (2018). Model agnostic supervised local explanations. arXiv preprint arXiv:1807.02910.

[6] G. Tanner (2019). Introduction to Machine Learning Model Interpretation. Blog post at https://gilberttanner.com/blog/introduction-to-machine-learning-model-interpretation